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Introduction. Estimation of covariance matrices in various norms is a critical issue that 
finds applications in a wide range of statistical problems, and especially in principal component 
analysis. It is well known that, without further assumptions, the empirical covariance matrix S* 
is the best possible estimator in many ways, and in particular in a minimax sense. However, it is 
also well known that X* is not an accurate estimator when the dimension p of the observations 
is high. The minimax analysis carried out by Tony Cai and Harry Zhou ([CZ] in what follows) 
guarantees that for several classes of matrices with reasonable structure (sparse or banded ma- 
trices), the fully data-driven thresholding estimator achieves the best possible rates when p is 
much larger than the sample size n. This is done, in particular, by proving minimax lower 
bounds that ensure that no estimator can perform better than the hard thresholding estimator, 
uniformly over the sparsity classes Q q for each < q < 1. This result has a flavor of universality 
in the sense that one and the same estimator is minimax optimal for several classes of matrices. 

Our comments focus on the sparsity classes of matrices. 

(a) Optimal rates. Optimal rates are obtained in [CZ] under the assumption that the dimension 
is very high: p > n u , v > 1. Thus, the case of dimensions smaller than n, or even p ~ n, is 
excluded. This seems to be due to the technique of proving the lower bound (Theorem 2 in 
[CZ]). Indeed, by a different technique, we show that the lower bound holds without this 
assumption, cf. Theorem 1 below. Furthermore, in general, our lower rate if)^' is different 
from that obtained in [CZ] and has ingredients similar to the optimal rate for the Gaussian 
sequence model. We conjecture that it is optimal for all admissible configurations of n,p, 
and sparsity parameters. 

(b) Frobenius norm and global sparsity. We argue that the Frobenius norm is naturally adapted 
to the structure of the problem, at least for Gaussian observations, and we derive optimal 
rates under the Frobenius risk and global sparsity assumption. 

(c) Approximate sparsity. Again under the Frobenius risk, one can obtain not only the min- 
imax results but also oracle inequalities. We demonstrate it for the soft-thresholding 
estimator. This allows us to deal with a more general setup where the covariance matrix 
is not necessarily sparse but can be well approximated by a sparse matrix. 
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Below we denote by \\A\\ the Frobenius norm of a matrix A: 



Af = Tr(AA^) = ^4 



where Tr(B) stands for the trace of square matrix B. Moreover, for q > 0, we denote by \v\ q 
the £ q -noTm of a vector v and by \A\ q the l q norm of the off-diagonal entries of A. We set 
|^4|o = Yli^j I( a ij 0) (the number of non-zero off-diagonal entries of A). The operator £ q — > l q 
norm of A is denoted by ||^4|| g . 

Frobenius norm and Sparsity. The cone of positive semi-definite (PSD) matrices can 
be equipped with a variety of norms, even more so than a vector space. [CZ] choose the || • ||i 
norm and consider classes of matrices that are essentially adapted to this metric. For example, 
the class Q q defined in (1) controls the largest l q norm of the columns of the covariance matrix 
£ with < q < 1 while the || • ||i norm measures the largest l\ norm of the columns of £ — £. 
Theorem 1 below indicates that for q = 1 consistent estimators do not exist. 

One may wonder whether faster rates can be obtained if, for example, £ has one row/column 
with large l q norm and all other rows/columns have small l q norm. It is quite clear that the 
|| • ||i norm fails to capture such a behavior and we need to resort to other norms. As we see 
below, this is achievable when the Frobenius norm is used. 

The Frobenius norm is a rather weak norm on the PSD cone. Indeed, it is very much a vector 
norm unlike the || • ||i norm used by [CZ] or the spectral norm that are operator norms. However, 
the choice of a norm is rather subjective but some general guidelines exist in a given statistical 
setup. It can be motivated by the idea of minimizing the Kullback-Leibler divergence between 
the true distribution and its estimator (see, e.g., Rigollet, 2012). This principle naturally gives 
rise to the use of the Frobenius norm in Gaussian covariance matrix estimation, as indicated by 
the following lemma. 

Lemma 1. Let I p be the p x p identity matrix and A be a symmetric p x p matrix such that 
Ip + A is PSD. Denote by Ps the distribution of J\f p (0,T,) (a zero-mean normal random variable 
in JR P with covariance matrix £ > 0). Then, for any < e < 1, the Kullback-Leibler divergence 
between Pi p + £ A and Pj satisfies 



KL(P Ip+eA ,P Ip )< 




2 



A|| 2 



where 



9(e) 



e - log(l + e) 

s 2 



Moreover if ||A||2 < 1, we have 



Kl(P Ip+£A ,P Ip )> 



(l-log2)e 2 
2 



A|| 2 . 



(1) 



Proof. Take S = I p + eA and observe that 



KL(P s ,P /p ) = lElog 




-IE log 



( 



det(S) 



1 



) 
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where X ~ Ap(0, E). Let Ai,...,A p denote the eigenvalues of A and recall that det(E) 
n^l + ^Aj). Moreover, 



JE[X T X - X T T,- 1 X] = Tr(IE[XX T ] - E^Efll 1 ]) = Tr(E - I p ) = eA 
Therefore, 



p -, p 



KL(P E ,P 7p ) = ±<T l [e\ j -log(l + e\ j )] < ~ J>(£A,)Af . 
i=i j=i 

Note now that since I p + A is PSD, then Xj > —1 for all j = 1, . . . ,p. Therefore, since g is 
monotone decreasing on (—1, oo), it yields g{eXj) < g{—e). The second statement of the lemma 
follows by observing that if || A||2 < 1, then eXj < e < 1 for all j = 1, . . . ,p. | 

Minimax lower bounds over classes of sparse matrices. We denote by <7y the elements 
of E and by o~(j\ the jth column of E with its jth component replaced by 0. For any q > 0, R > 0, 
we define the following classes of matrices: 

Gf\R) = {E g C >0 : <R, ou = l,Vi} , 

^(-R) = js G C >0 : maxJo-Q-)^ < P, cr^ = l,Vi| , 

where C>o is the set of all positive definite symmetric p x p matrices. For q = 0, we de- 
fine the classes Q^\r) and Qq{R) analogously, with the respective constraints |E|o < R and 



maxi<j< p |c(j)|o < R- Here P is an integer for the class Qq (R), and an even integer for (R) 



in view of the symmetry. We assume that R = 2k < p(p — 1) for (P) and R = k < p — 1 for 
Q^\R) where k is an integer. Set 



i_q , s 1-9 

' 2 



for some positive constant cq that does not depend on the parameters p, n, R. The following 
minimax lower bounds hold. 

Theorem 1. Fix R > 0, < q < 2, Cq > and integers n > 1, p > 2. Consider the conditions 

R{{logp)/nf^' 2 < Co, R{{\ogp)/n)^ 2 < C , R' 1 {{log p)/nf I 2 < C . (2) 

Let X\, . . . ,X n be i.i.d. J\f p {0, E) random vectors, and let w : [0,oo) —> [0,oo) be a monotone 
non- decreasing function such that w{0) = and w ^ 0. Then there exist constants cq > 0, c\ > 
0, c > depending only on Co such that, under the first and third conditions in (2), 

inf sup E^w(\\t-m/ Cl ip^)>c, (3) 
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and under the second and third conditions in (2), 

inf sup Evw(\\t - mjciipW) > c, V < q < 1, (4) 

where denotes the expectation with respect to the joint distribution of X\ , . . . , X n and the 
infimum is taken over all estimators based on X\, . . . ,X n . 

Proof. We first prove (3) with q = and R = 2k. Assume first that k < p 2 /16. We use 
Theorem 2.7 in Tsybakov (2009). It is enough to check that there exists a finite subset N of 
(2k) such that, for some constant C > and some ip > Cip(°\ we have 

(i) ||E-E'||>^, V E ^ E' G AAU{/ P }, 

(ii) n KL(P S , P /p ) < 2~ 4 log(cardAA) , V E G A/". 

We show that these conditions hold for 

^=f-logfl+ e ^" 1) ^ V2 



Let ^ be the family of all p x p symmetric binary matrices, banded such that for all B G £>, 
fry = if |i — j'l > \^k, with on the diagonal and exactly k nonzero over-diagonal entries equal 
to 1. Let M be the number of elements in the over-diagonal band where the entry 1 can only 
appear. For k < p 2 /4 we have M > p\fk — k > p\fkj2. Therefore for, k < p^/k/A, Lemma A. 3 
in Rigollet and Tsybakov (2011) implies that there exists a subset Bq of B such that for any 
B, B' €B ,B^ B', we have \\B - B'\\ 2 > (k + l)/4, and 

log(card B Q ) > C\k log ^1 + -^=) (5) 

for some absolute constant C\ > 0. Consider the family of matrices M = {E = I p + ^B : B G 
£>o} , where 



ao( i log ( 1+ i^) 



1/2 



for some ao > 0. All matrices in Af have at most 2\fk nonzero elements equal to a in each row. 
Therefore, the first inequality in (2) guarantees that for ao small enough, matrices I p + aB with 
B G Bq and, a fortiori, E G M are diagonally dominant and hence PSD. Thus, M C (2k) for 
sufficiently small ao > 0. Also, for any E, E' G J\f, E ^ E', we have 

||E - E'|| 2 > C 2 a 2 k 

for some absolute constant C2 > 0. It is easy to see that this inequality also holds with a 
different C2 if E or E' is equal to I p . The above display implies (i). To check (ii), observe first 
that since I p + aB is PSD, we can apply Lemma 1 with A = aB, e = 1/2, to get 

nKL(P s ,P Jp ) < " fl2 g(-l/2) ||B|| 2 < a 2 fcn; VSgAA . 
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To prove (ii), it suffices to take Oq < 2~ 4 Ci, and to use (5). This proves (3) with q = under 
the assumption k < p 2 /16. The case q = 0, k > p 2 /16 corresponds to a rate il>^ of order ^Jp/n 
and is easily treated via the Varshamov- Gilbert argument (we omit the details). 

Next, observe that (3) for < q < 2, follows from the case q = 0. Indeed, let k be the 
maximal integer such that 2ka q < R (we assume ao small enough to have k > 1, cf. the third 
inequality in (2)). Hence, |E|| = 2ka q < R for any S G M. Also, aVk < R x / 2 a l - q / 2 / y/2 and 
thus the first inequality in (2) ensures the positive definiteness of all £ G M for small ao- For 
this choice of k, we have k + 1 > Ra~ q /2 and fc < C 3 Rn^ 2 with some constant C3 > 0. It can 
be easily shown that (i) holds with 



. n 



1-1 , JNxl-j 



Ra- q J J ~ \n & V 



The proof of (4) is quite analogous, with the only difference that B is now defined as the 
set of all symmetric binary matrices with exactly k off-diagonal entries equal to 1 in the first 
row and in the first column and all other entries 0. Then, for k < (p — l)/2, Lemma A. 3 
in Rigollet and Tsybakov (2011) implies that there exists a subset B\ of B such that for any two 
distinct B, B' G B\, we have \b^ > (k + l)/4 (consequently, \\B — B'\\i > (k + l)/4) and 

log(cardfii) > Cifclog ^1 + ^ V ~ ^ . (6) 

Here, b^, b'^ are the first columns of B, B' with their first components replaced by 0. Thus, for 
any two distinct matrices S and £' belonging to the family N' = {S = Ip + |i?: B £ Bi} we 
have ||S — > C^k 2 for some constant C4 > 0. Here, M' C {k) thanks to the second 
inequality in (2). Also, by Lemma 1, KL(Ps ; Pi p ) < a2 k for all S G jV'. These remarks and (6) 
imply the suitably modified (i) and (ii) for the choice 

/ 1 / e(p-l)^ 1/2 
o = ao - log 1 + 



with ao small enough. The rest of the proof follows the same lines as the proof of (3). | 
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The lower bound (4) and Theorem 4 in [CZ] imply that the rate R ((log p)/n) 2 is optimal 
on the class Qg (R) under the || • ||i norm if Rn q l 2 < p a with some a < 1. In particular, for 
q = this optimality holds under the quite natural condition k = 0(p a ), and no lower bound 
on p in terms of n is required. Clearly, this is also true when we drop the condition S > in 
the definition of Gq (R) and consider a weak £ q constraint as in [CZ]. 

Note that the rate xfj^ is very similar to the optimal rate in the Gaussian sequence model, 
cf. Section 11.5 in Johnstone (2011). This is due to the similarity between the vector £2 norm 
and the Probenius norm. The rate is different but nevertheless has analogous ingredients. 
Observe also that, in contrast to the remark after Theorem 1 in [CZ], we prove the Frobenius and 
the || • ||i-norm lower bounds (3) and (4) by exactly the same technique. The key point is the use 
of the "/c-selection lemma" (Lemma A. 3 in Rigollet and Tsybakov (2011)). The lower bound (4) 
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improves upon Theorem 2 in [CZ] in two aspects. First, it does not need the assumption p > n u , 
v > 1, and provides insight on the presumed optimal rate for any configuration of n,p,R. 
Second, it is established for general loss functions w, in particular for the "in probability" loss 
that we consider below. The technique used in Theorem 2 of [CZ] is not adapted for this purpose 
as it applies to special losses derived from w(t) = t. 

Approximate sparsity and optimal rates. Along with the hard thresholding estimator 
considered by [CZ], one can use the soft thresholding estimator £ defined as the matrix with 
off-diagonal elements 

dij = sign(a* )(|a* | -r)+, 

where o~*j are the elements of the sample covariance matrix £*, r > is a threshold, and (•)+ 
denotes the positive part. The diagonal elements of £ are all set to 1 since we consider the classes 
Qq^(R), j = 0, 1. Then £ = I p + S g where £ fT admits the representation (the minimum is 
taken over all p x p matrices S with zero diagonal) : 

£ off = argmin {|S - E*|| + 2r|S|i} . 

S:diag(S)=0 



Take the threshold 



where A > 1 and 7 is the constant in the inequality (24) in [CZ]. 

Theorem 2. Let Xi,...,X n be i.i.d. random vectors in TRP with covariance matrix £ such 
that (24) in [CZ] holds. Assume that p,n, and A are such that r < 5, where 5 is the constant 
introduced after (24) in [CZ]. Then there exists C* > such that, with probability at least 
l-Gp 2 - 2 - 42 , 

P - < ngn L - Z\? + A ^3^il\ , (8) 

where mins denotes the minimum over all p x p matrices. 

Proof. Write a*j = aij + £y where the £y = a*j — Oij are zero-mean random variables, i 7^ j. 
Thus, considering a*- as observations, we have a sequence model in dimension p(p — 1). It is easy 
to see that it is a special case of the trace regression model studied in Koltchinskii et al. (2011) 
where A$ is a diagonal matrix with the p(p — 1) off-diagonal entries of S on the diagonal. In 
the notation of Koltchinskii et al. (2011), the corresponding matrices Xi are diagonalizations of 
canonical basis vectors, the norm || • ||z, 2 (n) coincides with the norm | • [2, and rank(S) is equal to 
the number of non-zero entries of diagonal matrix B. Thus, Assumption 1 in Koltchinskii et al. 
(2011) is satisfied with (j, = 1, and we can apply Theorem 1 in Koltchinskii et al. (2011). It 
yields a deterministic statement: 



< min < 

12 s 
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provided r > 2maxj^j \cr*j — o"^|. From (24) in [CZ] and a union bound, we obtain that, for r 
defined in (7), this inequality holds with probability greater than 1 — C*p 2 ~ 2A2 . | 

Corollary 1. Under the assumptions of Theorem 2, for any < q < 2, there exist constants 
C", C* > such that with probability at least 1 — C*p 2 ~ 2A , 

/l — , „\ i— o 1 

(9) 

Proof. Let / = 1, . . . ,p(p — 1), denote the absolute values of the off-diagonal elements 
of S ordered in a decreasing order. Note that for any p x p matrix S and any < q < 2 we 
have |s[/]| 9 < \S\l/L Fix an integer k < p(p — 1). Taking s'^ = if \sij\ > |s[ fc ]| and s'^ = 
otherwise, we get that for any S there exists & p x p matrix S' with |5'|o = k such that 

\S\ 2 k 1 - 2 / q 
l>k l>k 1 H 

Together with Theorem 2, this implies that for any integer k < p(p — 1) we have 

IV vl 2 ^ • Jon via ■ l g lg fel " 2/g ■ A + VSV ,2 2 fclogp 
|E-E| a <ima|2|5-E| 2 + -2 7 ^- r +^- r -j A 7 — - 

Optimizing the right hand side over /c completes the proof. | 

Note that the oracle inequalities (8) and (9) are satisfied for any covariance matrix S, not 
necessarily for sparse S. They quantify a trade-off between the approximation and sparisty 
terms. Their right-hand sides are small if E is well approximated by a matrix S with a small 
number of entries or with small £ q norm of the off-diagonal elements. If the matrix E is sparse, 
£ 6 Gq°\R), the oracle inequalities (8) and (9) imply that 



sup Fk(\\2-n\>c»&/*(^)*~ i )<c.ir-* 

^(R) V V n J j 



for some constant C" > 0. This also holds when we drop the condition S > in the definition of 
Gq°\R)- Combining this with Theorem 1, we find that the rate R 1 ^ 2 ((log is optimal 
on the class G^ (R) under the Frobenius norm if Rn<il 2 < p 2a with some a < 1. In particular, 
for q = this optimality holds under the condition k < p 2a with some a < 1. 
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